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Abstract 

We address the problem of answering queries over a distributed information system, storing objects 
indexed by terms organized in a taxonomy. The taxonomy consists of subsumption relationships between 
negation-free DNF formulas on terms and negation-free conjunctions of terms. In the first part of the 
paper, we consider the centralized case, deriving a hypergraph-based algorithm that is efflcient in data 
complexity. In the second part of the paper, we consider the distributed case, presenting alternative ways 
implementing the centralized algorithm. These ways descend from two basic criteria: direct vs. query 
re-writing evaluation, and centralized vs. distributed data or taxonomy allocation. Combinations of 
these criteria allow to cover a wide spectrum of architectures, ranging from client-server to peer-to-peer. 
We evaluate the performance of the various architectures by simulation on a network with 0(10*) nodes, 
and derive final results. An extensive review of the relevant literature is finally included. 

1 Introduction 

Consider an information source S structured as a tetrad S = (T, ^, Obj , /), where T is a set of terms, ^ is 
a taxonomy over concepts expressed using T (e.g. (Animal A FlyingObject) V Penguin ^ Bird), Obj is a 
set of objects and / is the interpretation, that is a function from T to V{Obj), assigning an extension (i.e., 
a set of objects) to each term. Now assume that there is a set J\f of such sources J\f = {Si, . . . all 
sharing the same set of objects Obj and related by taxonomic relationships amongst concepts of different 
sources. These relationships are called articulations and aim at bridging the inevitable naming, granularity 
and contextual heterogeneities that may exist between the taxonomies of the sources (for some examples see 
[36| ) . For example, the taxonomy of a source iSi could be the following: { Penguin ^ Animal, Pelicsin ^ 
Animal, Ostrich ^ Animal, (Animal A FlyingObject) V Penguin V Ostrich ^ Bird }. Si could have an 
articulation to a source 1S2 like { IlLV"fKovivo<; ^ Penguin, YieXtnavo^ ^ Pelican }, an articulation to 
a source iSa like { Animale A Alato < Birds }, and an articulation to two sources Si,S^ of the form: 
{ (Fliegentier) V (Animal A Volant) ^ (Animal A FlyingObject) }. 

Network of sources of this kind are nowadays commonplace. For instance, the objects may be web pages, 
and a source may be a portal serving a specific community endowed with a vocabulary used for indexing 
web pages. The objects may be library resources such as books, serials, or reports, and a source may be 
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a library describing the content of the resources according to a local vocabulary. The objects may be a 
category of commercial items, such as cars, and a source may be an e-commerce site which sells the items. 
And so on. Articulations may be drawn from language dictionaries, or may be the result of cooperation 
agreements, such as in the case of sources belonging to the same organization. In certain cases, articulations 
can be constructed automatically, for instance using the data-driven method proposed in |34] . Moreover, 
sources and articulations expressed in a syntactically richer language, such as a Semantic web language, are 
typically mapped down to the sources and articulations we assume, for computational reasons [7]. 

In this paper we address the problem of answering Boolean queries over networks of this kind of sources. 
The work is carried out in three stages. 

First, the theoretical aspects of query evaluation against a source are analyzed, and an algorithm is 
derived which extends a hypergraph-based method for satisfiability of propositional Horn clauses. The 
algorithm is conceptually very simple and has polynomial time complexity with respect to the size of Obj. 
This is in fact the theoretical lower bound. 

Secondly, we derive different implementations of the query evaluation algorithm, all of them exploiting 
the use of a cache for optimization reasons. The different implementations stem from the considerations of 
two orthogonal criteria: evaluation mode and data allocation. The first criterion leads to two alternative 
approaches: direct query evaluation vs. query re-writing. The second criterion leads to four possibilities, 
corresponding to the centralized vs. the distributed allocation of the taxonomy or of the interpretation. 
When considered in combinations, these two criteria give rise to five interesting distributed architectures, 
ranging from the well-known client /server architecture to the recently proposed peer-to-peer (P2P) systems. 

In considering the data allocation criterion, we have made the assumption that the sources are willing 
to make (some of) their data available for storing in a centralized way. This may not always be the case. If 
it is not the case, then only one implementation is possible, namely the one based on the P2P architecture, 
which reflects the model of a source at the physical level. 

Thirdly, the derived implementations are evaluated from a performance point of view, in terms of response 
time. The performance evaluation has been carried out by simulating the implementations on a network 
of 11400 sources. The network has been configured based on the parameters derived in a study on the 
Gnutella network. The results of the simulations show that the client-server implementation is, perhaps 
not surprisingly, the one offering the shortest response time. Amongst the other architectures, the best 
performance is attained, perhaps surprisingly, by centralizing the taxonomy while keeping the interpretations 
distributed and executing query evaluation in two stages. This is due to the fact that re-writing the query 
avoids multiple accesses to the same source, but this gain can be appreciated only if the taxonomy is 
centralized, so that a single access (to the taxonomy server) is necessary to perform the query re-writing. 

In sum, the three main results of the paper are: (i) a query evaluation procedure for a source; (ii) 
five different optimized implementations of the query evaluation procedure, corresponding to the considered 
architectures; all implementations are optimized, in that they make use of a cache; and (iii) a ranking of 
these algorithms, based on their performance. 

The paper is structured as follows: Section [2] introduces the model of information system studied in the 
paper, formulates the query evaluation problem, and derives a sound and complete query evaluation proce- 
dure for the centralized case. Section [3] illustrates a basic method for carrying out query evaluation, putting 
the theoretical notions developed in the previous Section into a concrete software perspective. Section U] 
discusses possible ways of implementing the basic method in a distributed setting, deriving five significant 
architectures. For each architecture, a description of the behaviour of the involved components is provided. 
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Section [S] presents an evaluation of the performance of the five architectures in terms of response time for 
query evaluation. Section [6] compares our work with related work and Section [7] concludes the paper. 

Finally, we would like to mention that initial ideas of this work have appeared in our conference paper |26j . 
This work extends j26j by providing (i) the query evaluation principles in Section 3, (ii) the algorithms and 
architectures for network query evaluation in Section 4, and (iii) the performance evaluation in Section 5. 

2 Foundations 

This Section defines information sources and the query evaluation problem. The algorithmic foundations of 
this problem are given and an efficient query evaluation method is provided. These results will be applied 
later, upon studying networks of sources. 



The basic notion of the model is that of terminology: a terminology T is a non-empty set of terms. From a 
terminology, queries can be defined. 

Definition 1 (Query) The query language associated to a terminology T, Ct, is the language defined by 
the following grammar, where t is a term of T : 

q ::= d \ qV d 
d ■.■.= t\tAd. 

An instance of q is called a query, while an instance of d is called a conjunctive query and a disjunct of q 
whenever d occurs in q. □ 

Terms and conjunctive queries can be used for defining taxonomies. 

Definition 2 (Taxonomy) A taxonomy over a terminology T is a pair (T, ^) where ^ is any set of pairs 
(g, d) where q is any query and d is a conjunctive query. □ 

For example, if T = {al, a2, 61, 62, 63, cl} then a taxonomy over T could be (T, ^) where (using an infix 
notation) {(61 A 62) V 63 ^ al A a2, al A a2 ^ cl}. 

If {q, q') G ^, we say that q is subsumed by q' and we write q ^ q' . 

Definition 3 (Interpretation) An interpretation for a terminology T is a pair {Obj,I), where Obj is a 
finite, non-empty set of objects and / is a total function assigning a possibly empty set of objects to each 



Interpretations are used to define the semantics of the query language: 

Definition 4 (Query extension) Given an interpretation / of a terminology T and a query q G Ct, the 
extension of q in I, q^ , is defined as follows: 



2.1 The model 



term in T, i.e. I : T ^ ViObj). 
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Since is an extension of the interpretation function /, we will simplify notation and will write I{q) in 
place of . We can now define an information source (or simply source). 

Definition 5 (Information source) An information source S' is a 4-tuple S = (T5, ^5, Objg,Is), where 
{Ts, r<s) is a taxonomy and ( Obj g, Is) is an interpretation for Tg. □ 

When no ambiguity will arise, we will omit the subscript in the components of sources and equate / with 
{Obj, I), for simplicity. Moreover, given a source S ~ (T, ^, Obj, I) and an object o G Obj, the index of in 
S, inds{o), is given by the terms in whose interpretation o belongs, i.e.: 

indsio) ^{teT \ oe lit)}. 

The interpretations that reflect the semantics of subsumption are as customary called models, defined 
next. 

Definition 6 (Models of a source) Given two interpretations /, /' of the same terminology T, 

1. / is a model of the taxonomy (T, ^) if (7 r^ 9' implies I{q) C I{q'); 

2. / is smaller than /', / < /', if I{t) C I'it) for each term t e T; 

3. / is a model of a source S = (T, ^, Obj, I') if it is a model of (T, ^) and I' < I. □ 

Based on the notion of model, the answer to a query is finally defined. 

Definition 7 (Answer) Given a source S = {T,^, Obj, I) and a query q e Ct, the answer of q in S, 
ans{q, S), is given by ans{q, 5) = {o G Obj \ a G J(q) for all models J of 5}. □ 

Since we are exclusively interested in query evaluation, we can restrict ourselves to simpler notions of 
sources and queries, which are equivalent to those defined so far from the answer point of view. To begin 
with, we observe that a pair {q,q') in a taxonomy is interpreted (in Definition [6] point [T]) as an implication 
q ^ q' . Now, by a simple truth table argument, it can be easily verified that the prepositional formula: 

(CiV...VC„)^(tiA...Ai„0 

where each Ci is any prepositional formula, is logically equivalent to the formula: 

(Ci ^ ti) A (Ci ^ t2) A . . . A (Ci ^ i„0 A . . . A (C„ ^ h) A (C„ ^ tj) A . . . A (C„ ^ t^), 

in that the two formulae have the same models. Based on this equivalence, the simplification of a taxonomy 
{T, d:) is defined as the taxonomy (T, ^'*), where: 

{iC,t) I (Ci V...VC„) ^ (ti A...Ai„), C e{Ci,...,Cn}, te{ti,...,t^}}. 

Correspondingly, the simplification of a source S = {T, ^, Obj, I) is defined to be the source = [T, 
, Obj , I). It is not difficult to see that: 

Proposition 1 J is a model of a source S if and only if it is a model of S**. □ 
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The simplification of tlic taxonomy in tlic previous example is given by: 
{(&1A62) al, (WA62) a2, 63 al, 63 a2, 63 a2, al A a2 cl}. 
For simplicity, from now on ^ and S will stand for and 5**, respectively. 

Finally, non-term queries can be replaced by term queries by inserting appropriate relationships into the 
taxonomy. Specifically: 

Proposition 2 For all sources S — (T, ^, Obj, I) and non-term queries q £ Ct, let tq be any term not in T 
and moreover 

T9 = TU{iJ 

= ^ U {(ti A ... A t,n, tq)\ ti A . . . A is a disjunct of g} 
/« = /U{(t„0)}. 

Then, ans(q, S*) = an.s(tg, S"') where 5? = (T?, ^"J, 06j, P). □ 

In practice, the terminology includes one additional term tq, which has an empty interpretation and 
subsumes each query disjunct tj A . . . A t^. The size of S** is clearly polynomial in the size of S and q. 

In light of the last Proposition, the problem of query evaluation amounts to determine ans(t, S) for given 
term t and source S*, while the corresponding decision problem consists in checking whether o G ans{t,S), 
for a given object o. 

2.2 The decision problem 

Given a source 5* = (T, ^, Obj, I), o ^ Obj, and t G T, the decision problem o £ ansit, S) has an equivalent 
formulation in propositional datalog. We define the propositional datalog program Ps as follows: 

Ps - Cs U /s U Qs 

where 

Cs = {f ^ h, . . . ,tra \ {h ^ . . . Mra,t') e^''} 

Is = {u <— I u e mds(o)} 
Qs = {^t} 

It is easy to see that: 

Lemma 1 For all sources S = (T, ^, 06j, /), o 6 Obj and t E T, o E ans{t, S) iff is unsatisfiable. □ 

Based on Lemma [1] the decision problem o G ans{t,S) is connected to directed B-hypergraphs, which 
arc introduced next. We will mainly use definitions and results from |18j . 

A directed hypergraph is a pair H = {V,£), where V = {vi,V2, ■ ■ ■ ,Vn} is the set of vertexes and £ = 
{El, E2, . ■ ■ , Ejn} is the set of directed hyperedges, where E^ = {T{Ei),x{,Ei)) with T{Ei),x{Ei) C V for 
1 < i < m. T{Ei) is said to be the tail of Ei, while x(-E'i) is said to be the head of Ei. A directed B- 
hypergraph (or simply B-graph) is a directed hypergraph, where the head of each hyperedge Ei, denoted as 
h{Ei), is a single vertex. 
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A taxonomy can naturally be represented as a B-graph whose hyperedges represent one-to-one the sub- 
sumption relationships of the transitive reduction of the taxonomy. In particular, the taxonomy B-graph of 
a taxonomy (T, ^) is the B-graph TL = (T, f^), where 

^r< = {{{ti,...,t,n},ii) I (ti A...At,„,M) e^''}. 
Figure [T] left presents a taxonomy, whose B-graph is shown in the same Figure right. 



al hi b2 a3 63 c3 

/ \ / \ I ^ ^ / 

a2 a3 cl c2 c2 A c3 al- 02- ^62- ^c2- 

/ \ t \ ^ 

63 61 A 62 61 A 63 61 - cl 



Figure 1: A taxonomy and its B-graph 
A path Pst of length q in a B-graph H ~ (V, is a sequence of nodes and hyperedges 

Pst = {S = Vl,Ei^ ,V2,Ei^,..., E,^ , Vq+l = t), 

where: s € T{Ei-^), h{Ei^) = t and h{Ei^_^) ~ Vj G T{Ei.) for 2 < j < q. If Pst exists, t is said to be 
connected to s. If t S T(Ei-^)^ Pst is said to be a cycle; if all hyperedges in Pst are distinct, Pst is said to be 
simple. A simple path is elementary if all its vertexes are distinct. 

A B-path TTst in a B-graph V. = (V, £) is a minimal (with respect to deletion of vertexes and hyperedges) 
hypergraph = (Vti^Stt), such that: 

1. 5^ C 5 

2. C 

3. a; G and x ^ s imply that x is connected to s in by means of a cycle- free simple path. 

Vertex y is said to be B-connected to vertex a; if a B-path Tr^-j, exists in %. 

B-graphs and satisfiability of propositional Horn clauses are strictly related. The B-graph associated to 
a set of Horn clauses has 3 types of directed hyperedges to represent each clause: 

- the clause p gi A (72 A . . . A is represented by the hyperedge ({gi, 92, ■ • ■ , <Zs},p); 

- the clause <— gi A (72 A . . . A is represented by the hyperedge ({qi, q2, ■ ■ ■ , Qs}, false); 

- the clause p is represented by the hyperedge {{true},p). 
The following result is well-known: 

Proposition 3 (jl8]) A set of propositional Horn clauses is satisfiable if and only if in the associated 
B-graph, false is not B-connected to true. □ 
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Figure 2: An object graph 



We now proceed to show the role played by B-connection in query evaluation. For a source S ~ (T, ^ 
, Obj , I) and an object o G Obj , the object decision graph (simply the object graph) is the B-graph Ho = 
(T, i?o), where 

£o = £-<u|J{({<rwe},w) | u S mc!s(o)}. 

Figure m presents the object graph for the taxonomy shown in Figure [T] and an object o such that inds{o) = 
{cl,c2,c3}. 

We can now prove: 

Proposition 4 For all sources S ~ {T,^, Obj, I), terms t € T, and objects o G Obj, a € ans{t,S) iff t is 
B-connected to true in the object graph Ho- 

Proof: From Lemma[Tl o G ans{t, S) iff is unsatisfiable iff (by Proposition [3l) false is B-connected to frue 
in the associated B-graph. By construction, Ho is the B-graph associated to Ps, where t plays the role of 
false. □ 



2.3 An algorithm for query evaluation 

A typical approach for query evaluation is resolution, recently studied for peer-to-peer networks O [7l [5] . 
Here, we propose a simpler method to perform query evaluation, based on B-graphs. Our method relies on 
the following result, which is just a re-phrasing of Proposition UJ 

Corollary 1 For all sources S = [T, ^, Obj, I), a E Obj and term queries t G T, o E ans{t, S) if and only if 
either o G I{t) or there exists a hyperedge ({ui, . . . , u, }, t) G £^ such that o G f]{ans{ui, S) \ 1 < i < r} . D 

This corollary simply "breaks down" Proposition |4] based on the distance between t and true in the object 
graph T-Lo- If o G I{t), then t G inds{o), hence there is a hyperedge (in fact, a simple arc) from true to t in 
T-Lo: which are 1 hyperedge distant from each other. If o ^ /(t), then there are at least two hyperedges in 
between true and t. Let us assume that h is the one whose head is t. Since t is B-connected to true, each term 
Ui in the tail of h is B-connected to true. But this simply means, again by Proposition |4l that o G ans{ui, S) 
for all the terms Ui, and so we have the forward direction of the Corollary. The backward direction of the 
Corollary is straightforward. Notice that, by point 3 in the definition of B-path, t is connected to each Ui by 
a cycle-free simple path; this fact is used by the procedure Qe in order to correctly terminate in presence of 
loops in the taxonomy B-graph %. 
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QE(a:: : term ; A : set of terms); 

1. i? ^ I{x) 

2. for each hyperedge {{ui, . . . , Ur}, x) in T-L do 

3. if {lii, . . . nA = then i? ^ i? U (QE(tii, A U {ui}) n . . . n QE(wr, A U {ur})) 

4. return(i?) 

Figure 3: The procedure Qe 



Table 1: Evaluation of QE(a2, {a2}) 



Call 


Result 


QE(a2,{a2}) 


I{a2) U QE(&3,{a2,63}) U (Qe(&1, {a2, W}) n Qe(62, {a2, 62})) 


QE(63,{a2,63}) 


7(63) 


QE(61,{a2,61}) 


7(61) U QE(cl,{a2,&l,cl}) U Qe(c2, {a2, 61, c2}) 


QE(62,{a2,62}) 


7(62) U (QE(c2,{a2,62,c2}) n Qe(c3, {a2, 62, c3})) 


QE(cl,{a2,61,cl}) 


7(cl) 


QE(c2,{a2,61,c2}) 


7(c2) * 


QE(c2,{a2,62,c2}) 


7(c2) U (QE(61,{a2,62,c2,61}) n Qe(63, {a2, 62, c2, 63})) 


QE(c3,{a2,62,c3})) 


7(c3) 


QE(61,{a2,62,c2,61}) 


7(61) U Qe(c1, {a2, 62, c2, 61, cl}) * 


QE(63,{a2,62,c2,63})) 


7(63) 


QE(cl,{a2,&2,c2,&l,cl}) 


7(cl) 



The procedure Qe, presented in Figure [U computes ans(t, S) for a given term t (and an implicitly given 
source S) by applying in a straightforward way Corollary [T] To this end, Qe must be invoked as Qe(<, {t}). 
The second input parameter of Qe is the set of terms on the path from t to the currently considered term x. 
This set is used to guarantee that t is connected to all terms considered in the recursion by a cycle- free simple 
path. Qe accumulates in R the result. The correctness of Qe can be established by just observing that, 
for all objects o S 06j, o is in the set 7? returned by QE(t, {t}) if and only if o satisfies the two conditions 
expressed by Corollary [TJ 

As an example, let us consider the sequence of calls made by the procedure Qe in evaluating the query a2 
in the example source of Figure [1] as shown in Table [1] The calls marked with a * are those in which the test 
in line 3 gives a negative result. Upon evaluating Qe(c2, {a2, 61, c2}) the procedure realizes that the only 
incoming hyperedge in c2 is ({61, 63}, c2), whose tail {61, 63} has a non-empty intersection with the current 
path {a2,61,c2}; so the hyperedge is ignored. In this case, the cycle (61,c2,61) is detected and properly 
handled. Analogously, upon evaluating Qe(61, {a2, 62, c2, 61}), the cycle (c2, 61, c2) is detected and properly 
handled. Also notice the difference between the calls Qe(c2, {a2, 61, c2}) and Qe(c2, {o2, 62, c2}). They both 
concern c2, but in the former case, c2 is encountered upon descending along the path (a2, 61, c2) whose next 
hyperedge is ({61, 63}, c2); following that hyperedge, would lead the computation back to the node 61, which 
has already been met, thus the result of the call is just 7(c2). In the latter case, c2 is encountered upon 
descending along the path (a2, 62, c2), thus the hyperedge leading to 61 and 63 must be followed, since none 
of the terms in its tail have been touched upon so far. 

From a complexity point of view, Qe visits all terms that lie on cycle-free simple paths ending at the 
query term t in the taxonomy B-graph H. Now, it is not difficult to see that these paths may be exponentially 
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many in the size of the taxonomy. As an illustration, let us consider the taxonomy whose B-graph contains 
the following hyperedges: 

hi : {{ui,Vi},U2) h2 ■■ {{u2,V2},U3) ... /i„_i : ({w„_i,u„_i},w„) /i„ :({-«„, 

gi : i{ui,Vi},V2) 92 ■ i{u2,V2},V3) ... gn-l : i{Un-l,Vn-l},Vn) 

Let us assume t is the query term. It is easy to verify that there are 2"~^ cycle- free simple paths connecting 
Ml to one for each sequence of the form 

(Ui fl X2 f2 ■ ■ ■ Xn-l fn-1 Xn K t) 

where fi can be either (in which case cc^+i is Ui+i) or gi (in which case x^+i is Wi+i) for 1 < i < n — 1. 

On the other hand, for each query term, Qe performs set-theoretic operations on sets of objects, which 
initially are interpretations of terms. Thus, we conclude that Qe has polynomial time complexity w.r.t. the 
size of Ohj . 

2.4 Networks of Information Sources 

In this Section we complete the definition of our model by introducing networks of information sources. In 
order to be a component of a networked information system, a source is endowed with additional subsumption 
relations, called articulations, which relate the source terminology to the terminologies of other sources of 
the same kind. 

Definition 8 (Articulation) Given two terminologies T and U, an articulation from T to [/, is a pair (q, t) 
where q S Cu is a conjunctive query and t € T. □ 

An articulation is not syntactically different from a subsumption relationship, except that its head may 
be a term of a different terminology than the one where the terms making up its tail come from. 

Definition 9 (Articulated source) An articulated source S over fc > disjoint terminologies Ti, . . . ,Tk, 
is a 5-tuple S = (T5, ^5, Obj, Is,Rs), where: 

- (T5, ^5, Obj, Is) is a source; 

- Rs is a set of articulations {q,t) where t G Tg, q is a conjunctive query in Ct and T = ufL^Ti. □ 

Articulations are used to connect an articulated source to other articulated sources, so creating a net- 
worked information system. An articulated source iS with an empty interpretation, i.e. Is(t) = for all 
t £Tg, is called a mediator in the literature. 

Definition 10 (Network) A network of articulated sources, or simply a network, A/" is a non-empty set 
of articulated sources Af = {Si, . . . ,Sn}, where each Si is articulated over the terminologies of some of the 
other sources in A/" and all terminologies T^^ , . . . , Tg^ of the sources in J\f are disjoint. □ 

Notice that the domain of the interpretation of an articulated source is independent from the source, 
thus the same for any articulated source. This is not necessary for our model to work, just reflects a typical 
situation of networked resources such as URLs. Relaxing this constrain would have no impact on the results 
reported in the present study. 
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An intuitive way of interpreting a network is to view it as a single source which is distributed along 
the nodes of a network, each node dealing with a specific vocabulary. The global source can be logically 
constructed by removing the barriers which separate local sources, as if (virtually) collecting all the network 
information in a single repository. The notion of network source captures this interpretation of a network. 

Definition 11 (Network source, netv^rork query) The network source Sj^ of a network of articulated 
sources J\f — {Si, . . . ,iS„}, is the source = (T^, ^a^, Obj, Ij^), where: 

A network query is a query over Ta/. □ 

The source Sj\f emerges in a bottom- up manner from the articulations of the sources, as postulated in [?]. 
Note that Definition [3] does not necessarily imply that ^5, Rs, and Is are stored in the articulated source 
S. In fact, given a network of articulated sources TV, in Section 21 several architectures will be considered 
for storing Qj\f and I_\f. A network query is a query in anyone of the languages of the sources making up the 
network. As it will be evident, our query evaluation method only requires minor modifications to be able to 
evaluate also queries in the language Ctj^, that is queries that mix terms from different terminologies. 

The answer to a network query q, or network answer, is given by ans{q, Sj\f). 

Figure m presents the taxonomy of a network source Sj\f, where Af consists of 3 sources A/" = {Pa, Pb, Pc}- 
As it can be verified, this is the same taxonomy as the one shown in Figure [U except that now some of its 
subsumption relationships are elements of articulations. 




Figure 4: A network taxonomy 



3 Query Evaluation Principles 

Before delving into distributed and optimized architectures, we now put the query evaluation problem in a 
software perspective, illustrating the basic principles that will be followed in subsequent developments. 
We begin by distinguishing between two basic approaches for carrying out query evaluation: 

- The Direct approach, in which the answer is computed in one stage. 

- The Rewriting (or two-stage) approach, in which query evaluation is performed in two stages: a re-write 
of the query is computed in the first stage and evaluated in the second stage. 
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Each one of these approached will be illustrated in the rest of this Section in a general way. The imple- 
mentation details will be specified in the next Section. In both cases, the processes performing the c^uery 
evaluation task will communicate in an asynchronous way, by exchanging messages through the appropriate 
queues. In this way, no process is blocked waiting for some other process to finish and the number of servers 
can be expanded at will. The former fact favours efficiency, while the latter favours scalability. 

3.1 Direct query evaluation 

The main processes involved in direct query evaluation are: 

- Query, 

- Ask, and 

- Tell. 

3.1.1 Query 

Query has two main tasks: to handle the communication with applications and to initiate query evaluation. 
Following the syntax for queries given in Definition [1] Query receives in input queries q of the form: 

where each Ci is a conjunctive query. As a first step. Query reduces g to a term query t by (a) generating a 
new term t not in T, and (b) inserting a new hyperedge (Ci, t) into the taxonomy B-graph for each conjunctive 
query Ci (see Proposition [2]). A new query id ID for t is subsequently obtained, and an ask message is sent 
(see below) for evaluating t, including ID, t, and the set of already visited terms, that is just t. Finally, ID 
is returned, to allow the requesting application to retrieve the query result as soon as it is available. 

As an example, let us consider the network shown in Figure 01 whose B-graph is given in Figure [T] left, 
and the query a2 A a3. When given as input to Query, a new term t is generated and the hyperedge 
({a2, a3}, t) is added to the taxonomy B-graph. The id of the new query is also generated, let it be ql. Then 
Query sends the message ask(ql,i,{i}) and returns ql. 

3.1.2 Ask 

An ask message represents the request of evaluating a term query and consists of 3 fields: 

- the id of the term query; 

- the term constituting the query; 

- the set of already visited terms (as requested by Qe). 

The basic task of Ask is to analyze an ask message in order to ascertain whether there are hypercdges to 
consider for evaluating the given term query, i.e. any hyperedge that passes the test on line 3 of Qe. If 
yes. Ask breaks down the query into sub-queries as established by Qe, and launches the evaluation of these 
sub-queries by issuing the corresponding ask messages. If not, it just returns the interpretation of the given 
term. 
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In our running example, upon processing the message (ql, t, {t}), Ask finds that the hyperedge 
({a2,a3},t) needs to be considered. This hyperedge requires breaking the term query ql into two sub- 
queries, one for the term a2 and one for the term a3. The intersection of the results of these sub-queries 
will have to be computed in order to have the final result. Now each sub-query needs a unique identifier; let 
us assume that q2 is the identifier of the sub-query relative to term a2, and g3 is that relative to a3. Then 
Ask issues the following ask messages: 

- {q2, a2, {t,a2}), and 

- (^3, a3, {t, a3}). 

Notice that in each message the set of visited terms is expanded as established by Qe. 

In order to keep track of the evaluation of a query ID, a query program is associated to ID, given by a 
set of sub-programs {SPi, . . . , SPk} where each sub-program SPj is relative to a hyperedge to be considered 
in the evaluation of ID, and is given by a set of calls. A call represents a sub-query of ID, and can be: 

- open, meaning that the sub-query is being evaluated, in which case the call is the sub-query id; 

- closed, meaning the sub-query has been evaluated, in which case the call is the resulting set of objects. 

A query program is closed if all calls in it are closed. In the above example, the program associated to ql 
consists of just one sub-program (since there is only one relevant hyperedge) given by {q2,q3}. 

Upon processing the message (q3, a3, {t,a3}), Ask finds that there are no hyperedges incoming into 
term a3. Thus the query can be evaluated immediately, which Ask does by issuing the tell message (g3, 
/(a3)), which just tells that the result of q3 is /(a3). 



3.1.3 Tell 

When a tell message {QID, R) is processed, QID is a sub-query of some other query QIDi, that is an open 
call in the query program associated to QIDi. The basic task of Tell is to ascertain whether the result 
of QID is the last one needed for computing the result of QIDi. If not. Tell just records the result of 
QID by replacing QID by R in the query program of QIDi. In our example, upon processing the message 
(gS, /(a3)). Tell updates qVs program which becomes {q2,/(a3)}. 

If QID is the last open call, then the query program of QIDi is closed, in which case Tell computes 
the result of QIDi and communicates it by issuing a corresponding tell message. The result of a closed 
program {5Pi, . . . , SPm}, where each sub-program SPi is a collection of object sets SPi = {R\, . . . , i?™.}, 
is the set of objects given by: 

m mi 

U n Ry (1) 

i=l j=l 

Notice that the processing of a tell message may cause the issue of another tell message, and so on, until 
eventually all the sub-query's programs of a query are closed and the final answer is obtained. 

The complete series of ask and tell messages produced during the evaluation of the query a2 in the 
example source of Figure [T] is given in Table [5J whose columns show: the incoming message, the generated 
messages (when no message is generated, the changes to the relevant query program are reported), and the 
query program generated in ask messages. Queries are identified by non-negative integers, while R{n) stands 
for the result of query n. This Table should be compared with Table [1] showing the sequence of Qe calls for 
the same query evaluation. 
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Table 2: Messages generated in the direct evaluation of the query a2 
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ask(4,62,{a2,62}) 


ask(7, c2, {a2, 62, c2}), ask(8, c3, {a2, 62, c3})) 


{{7,8}} 


ask(5,cl,{a2,61,cl}) 


tell(5,/(cl)) 




ask(6,c2,{a2,61,c2}) 


tell(6,/(c2)) 




ask(7, c2,{a2,62,c2}) 


ask(9, 61, {a2, 62, c2, 61}), ask(10, 63, {a2, 62, c2, 63}) 


{{9,10}} 


ask(8,c3,{a2,62,c3})) 


tell(8,/(c3)) 




ask(9,61,{a2,62,c2,61}) 


ask(ll, cl, {a2, 62, c2, 61, cl}) 


{{11}} 


ask(10,63,{a2,62,c2,63})) 


tell(10,/(63)) 




ask(ll, cl, {a2, 62, c2, 61, cl}) 


tell(ll,/(cl)) 




tell(ll,/(cl)) 


tell(9,/(61)Ui?(ll)) 




tell(10,/(63)) 


the query program of 7 becomes {{9,i?(10)}} 




tell(9,/(61) Ui?(ll)) 


tell(7, /(c2) U (i?(9) n i?(10))) 




tell(8,/(c3)) 


the query program of 4 becomes {{7, i?(8)}} 




tell(7, /(c2) U (i?(9) n i?(10))) 


tell(4,/(62)U(i?(7)ni?(8))) 




tell(6,/(c2)) 


the query program of 3 becomes {{5}, {i?(6)}} 




tell(5,/(cl)) 


tell(3,/(61)Ui?(5) Ui?(6)) 




tell(4, /(62) U {R{7) H R{8))) 


the query program of 1 becomes {{2}, {3, i?(4)}} 




tell(3,/(61) Ui?(5) Ui?(6)) 


the query program of 1 becomes {{2}, {i?(3), i?(4)}} 




tell(2,/(63)) 


tell(l, J(a2) U R{2) U (i?(3) n i?(4))) 
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3.2 Correctness and complexity 

The combined action of Ask and Tell is equivalent to the behavior of the procedure Qe. To see why, it 
suffices to consider the following facts: 

1. An ask message is generated for each recursive call performed by Qe and vice- versa, that is whenever 
Qe would perform a recursive call, an ask message is generated. Therefore, the number of ask messages 
is the same as the number of terms that can be found on all B-paths from t. 

2. For each ask message, exactly one tell message results. This can be observed by considering that, 
for each processed ask message, there can be two cases: 

(a) there is no hyperedge to consider: in this case, no subsequent ask message is generated, and a 
tell message is generated; 

(b) there is at least one hyperedge to consider: in this case a number of sub-queries is generated and 
evaluated by issuing the corresponding ask messages; each such message has a larger set of visited 
terms. Since the B-graph is finite, eventually each sub-query will lead to a term falling in the 
previous case (this is how Qe terminates). When the program of all sub-queries of a given term 
query t is closed. Tell issues a tell message on t. This will propagate the closure upwards, until 
all open calls are closed. 

3. Finally, a closed query program is interpreted by computing (see expression ([T])) the same operation 
on the result of sub-queries as Qe does on the results of its recursive calls. 

As a consequence of these facts, we have the correctness of the above described network query evaluation 
procedure, and also its polynomial time complexity with respect to the size of Obj . Note that the total number 
of messages generated is twice the number of terms visited by Qe, and the number of query programs is no 
larger than that. 

3.3 Re- writing based query evaluation 

A query re- write represents in a symbolic way the computation of a query result according to the procedure 
Qe. Specifically, a query re-write is a syntax tree with 3 types of nodes: union nodes, intersection nodes 
and terms. The first two types of nodes are non-terminal, whereas all terminal nodes are terms. To evaluate 
a query re-write means to replace the terms by their interpretation and then to execute the unions and the 
intersections as they appear in the syntax tree, finally obtaining a set of objects. The re-write of the query 
whose Qe calls are presented in Table [TJ is given in Figure [5l 

Query evaluation based on re- writing is a slight variation of the method presented in Section |3l in which 
ask messages and query programs are exactly the same, while tell messages return linearizations of sub-trees 
of the re-write; the last tell message returns the whole re-write in a linear form. 

To exemplify. Table [3] shows the tell messages produced in the re- writing of the query a2, whose direct 
evaluation is reported in Table [2j 
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Figure 5: A re-writc of the query shown in Table [T] 



Tabic 3: Messages generated in the re- writing of the query a2 



Incoming message 


Generated messages 


ask(2,63,{a2,63}) 


tell(2,"63") 


ask(5, cl, {a2, 61, cl}) 


tell(5,"cl") 


ask(6,c2,{a2,61,c2}) 


tell(6,"c2") 


ask(8,c3, {a2,62,c3})) 


tell(8,"c3") 


ask(10,63,{a2,62,c2,63})) 


tell(10,"63") 


ask(ll, cl, {a2, 62, c2, 61, cl}) 


tell(ll,"cl") 


tell(ll,"cl") 


tell(9, "61Ui?(ll)") 


tell(10,"63") 


the query program of 7 becomes {{9, i?(10)}} 


tell(9, "61 Ui?(ll)") 


tell(7,"c2U (i?(9) ni?(10))") 


tell(8,"c3") 


the query program of 4 becomes {{7, i?(8)}} 


tell(7,"c2U (i?(9) ni?(10))") 


tell(4,"62U (i?(7) ni?(8))") 


tell(6,"c2") 


the query program of 3 becomes {{5}, {i?(6)}} 


tell(5,"cl") 


tell(3,"6lUi?(5)Ui?(6)") 


tell(4,"62U (i?(7) ni?(8))") 


the query program of 1 becomes {{2}, {3,i?(4)}} 


tell(3,"6lUi?(5) Ui?(6)") 


the query program of 1 becomes {{2}, {i?(3), i?(4)}} 


tell(2,"63") 


tell(l,"a2 U i?(2) U (i?(3) n i?(4))") 
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4 Algorithms and Architectures for Network Query Evaluation 



We now consider distributed architectures for network query evaluation, based on the approaches outhned 
in the previous Section. In order to identify all significant architectures, it is important to consider how 
taxonomies and interpretations are allocated on the network. In this respect, there are four possibilities: 

- Both the network taxonomy and interpretation are centralized, that is allocated on one source, which 
is termed global server. 

- The network taxonomy is centralized in the taxonomy server, while each source holds its own inter- 
pretation. 

- Each source holds its taxonomy (including articulations) , while the network interpretation is allocated 
to a single source, the interpretation server. 

- Both the network taxonomy and interpretation are distributed to the sources, in a pure peer-to-peer 
model. 

Considered in conjunction with the two evaluation approaches identified in the previous Section, i.e. direct 
and re-writing, these possibilities give rise to 8 different architectures. In order to indicate any one of 
them, we will use 3-letter names, as follows: the first and second letters denote, respectively the allocation 
of taxonomy and interpretation (C standing for centralized and D for distributed), while the third letter 
indicates the type of evaluation (D for direct, R for rewriting). Thus, CDR denotes the approach in which 
the taxonomy is centralized, the interpretations are distributed and the query is first re-written and then 
evaluated. The rest of this study is devoted to rank these methods with respect to their performance in terms 
of response time. In this respect, some approaches stand out immediately as not particularly promising. 
Namely, 

- When there is a global server source, query re-writing (CCR) is clearly a looser with respect to the 
direct approach (CCD): if everything is in one place, there is no gain to be made in following a two-stage 
approach; as a consequence, the approach CCR will no longer be considered. 

- For the opposite reason, CDD is a clear looser with respect to CDR: if the taxonomy is centralized, 
in CDR the taxonomy server is contacted only once to re-write the query, while in CDD is invoked at 
every sub-query evaluation. 

- For the same reason, also DCD is a clear looser with respect to DCR: if the interpretation is centralized, 
it is more convenient to re- write the query and consult the interpretation server only once for the final 
evaluation, rather than invoking it repeatedly. 

Before delving into the analysis of the remaining 5 methods, we present the model of a source, which is 
common to all methods. 

4.1 Model of a source 

A source (Figure consists of three main architectural elements: 

- Applications, which formulate queries and wait to receive the corresponding answers. 
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- Source Component, which is a set of processes, exposed via an API, implementing query evahiation 
according to the principles outlined in the previous Section. As we will see, the methods exposed by 
a Source Component, as well as their semantics, may vary depending on the approach. 

- Communication Components, consisting of the data structures and the processes which manage the 
interaction between the Source Component from one hand, and the network and the Applications 
from the other. Inter-process communication is implemented by means of queues, as anticipated. The 
following queues are part of the architecture of every type of source: Input Request Queue (IRQ), 
Output Request Queue (ORQ) and Answer Queue (AQ). Query evaluation requests (whether from 
local applications or from other sources) are handled by the Query Receiver, which places them on 
the IRQ, from where the IRQ Server dequeues them for processing by the Source Component. Once 
a query is evaluated, the answer is placed on the AQ or on the ORQ, depending whether the request 
comes from a local application or another source, respectively. Messages posted on the ORQ of a 
source are directed to the IRQ of the receiving source. Due to the optimization techniques used for 
representing object identifiers and to the assumptions in query evaluation, messages are one-to-one 
with network packets, thus messages are the units of communication. 



Application 



Query 
Receiver 



1 



IRQ 



IRQ 
Server 



AQ 



ORQ 



Source 
Component 



Network 



Figure 6: Architecture of a Source 

Every source uses a data structure, called Query Cache (QC for short). QC stores partial results and 
additionally works as a Cache, storing final results for re-use. Two different time-outs arc therefore used: the 
answer time-out [to), that is the amount of time until an answer is waited for; and the cache time-out (tc), 
that is the amount of time an answer is cached for re-use. At any time, an object in the QC is associated 
only with one time-out, depending on its state. 

A QC object corresponds to a query or a sub-query and has the following attributes: 

Query-ID the identifier of the object; we do not make any assumption on the structure of the identifier, 
except that the source where it has been created must be recoverable for it (for instance as an IP 
number) ; 

exp the query expression; this may be the original query or a single term; 
state the state of the object (see below); 
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dependency (dep, for short) the Qucry-ID of the oldest query q into the QC having the same expression 
as the present query, if such a query exists; null otherwise; this attribute is kept for re-using the result 
of q also for the present query, thus optimizing performance; 

answer the answer of the query, if it has been computed; null otherwise; 

time-out the expiration time of the object; after that point the object will eventually be deleted; 
QP the query program associated to the object; 
n the number of open calls in QP; 

rewriting a Boolean value set to true if the first stage of the re-writing approach has been completed, to 
false otherwise. 

During its life inside the QC of a source, an object may be in one of the following states (in what follows 
we will use "query" to mean the (sub)query associated to the QC object): 

free the query is being evaluated, no answer has arrived for it so far, and no other query depends on it; 

principal the query is being evaluated, no answer has arrived for it so far, and there exists at least one 
other query that depends on it; 

dependent the query has been received but no evaluation for it has been launched, since there exists a free, 
principal or declined query with the same expression, which is being evaluated; 

declined the query has expired before an answer for it was received, but it has not been deleted because 
other queries are dependent on it; 

closed the answer for the query has been computed, and the query is kept in the Cache in order to re-use 
the answer, until it expires; 

total the query is a term of an original query, it has to be evaluated, and the answer can be used to answer 
queries given by the same term. 

partial the query is a term that has been encountered during the evaluation of a total term, thus its answer 
cannot be re-used to answer queries with the same expression. 

In the re-writing approaches, two more states are defined: 

f ree-rw the query has been re-written and is being evaluated, no answer has arrived for it so far, and no 
other query depends on it; 

principal-rw the query has been re-written and is being evaluated, no answer has arrived for it so far, and 
there exists at least one other query that depends on it. 

4.2 Query evaluation in the CCD Approach 

In the CCD approach, the network taxonomy and interpretation are centralized (in Server Sources, see 
below), while the answer in computed in one stage. This approach provides us with the basic concepts 
for describing the rest of the architectures. Additionally, it provides us with the lowest bounds in our 
performance evaluation of the different architectures. 

In this approach, we can distinguish two different types of sources: Client Source and Server Source^ 
named after the fact that this is indeed the classical client-server architecture. 



18 



4.2.1 Client Source 

A Client Source (CS, for short) does not perform any one of the operations involved in query evaluation. It 
receives queries from local applications and simply sends them (via the ORQ) to a Server Source for evalua- 
tion; when the corresponding answers arrive (in the IRQ), the CS makes them available to the applications 
(in the AQ). 

The state machine presented in Figure [7] models the life-cycle of a QC object in a Client Source. The 
QC of a CS only contains objects corresponding to full queries, hence no total or partial objects. We 
can distinguish three types of events: a new query arrives; an answer arrives; and a time-out expires. The 
arrival of a query starts the life-cycle of a query. There may be three cases: 

- No object exists having as expression the incoming query; in this case, a new object is created and its 
state is set to free. 

- A closed object o having as expression the incoming query exists; in this case, the answer of o is used 
for answering immediately the incoming query, and no new object is created. 

- A free, principal or declined object o exists having as expression the incoming query; in this case, 
no evaluation needs to be launched for the incoming query, because a query with the same expression 
is being currently evaluated. Consequently, a new object is created, its state is set to dependent and 
its dependency is set to o. 

A free object is deleted when its answer time-out expires; in this case, the requesting application is 
notified that no answer for the query could be computed. Otherwise, a free object becomes: 

- closed when the corresponding answer arrives; or 

- principal if a dependency to it is created before the answer arrives. 

A principal object, corresponding to a query not yet answered but with other queries depending on it. 
becomes: 

- closed when the corresponding answer arrives; 

- declined when the answer time-out expires. In this case, the requesting application is notified that 
no answer for the query could be computed but the object is not destroyed because the answer to the 
corresponding query is required to answer the queries of the dependent objects. 

When the answer to a declined object arrives, the object becomes closed, and the just arrived answer is 
used for answering all queries dependent on it. The object is destroyed if all dependent objects expire (and 
are destroyed) before the answer arrives; this is the only way such an object dies; in other words, a declined 
object stays in the Cache until there is a query dependent on it. 

A dependent object can transition only into the final state, i.e. be destroyed. This can happen in two 
different ways: 

- if the answer time-out expires, the object is destroyed and a null answer is generated for it; 

- if the answer to the object on which it depends arrives, then that answer is used for it too. 

Finally, a closed object stays in the QC until its cache time-out expires. After that, it is destroyed (by 
the Cache Manager, see below). 

The Source Component of a CS (Figure [5] left) includes three main processes: 
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Figure 8: Client and Server Source Components for the CCD approach 
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- Client Query Processor, invoked by the IRQ Server upon arrival of a new query on the IRQ. It handles 
this event as described above. 

- Client Answer Processor, invoked by the IRQ Server upon arrival of an answer message on the IRQ. 

- Client Cache Manager, which periodically inspects the QC in order to manage the expired objects. 
Depending on the state of these objects, appropriate action is taken, as established by the query 
life-cycle. 

4.2.2 Server Source 

A Server Source (SS, Figure |8] right) receives queries either from the served Client Sources or from local 
applications (see Section [3]); it evaluates these queries and sends the answers to the appropriate requesters 
by placing them on either on the ORQ or on the AQ. 
Four types of events can occur on a SS: 

- A new query arrives. 

- A time-out expires. 

- An ask message arrives. 

- A tell message arrives. 

The state machine modeling the life-cycle of a QC object in a SS is a simple extension of that for a CS, and 
is not reported for brevity. The transitions generated by the arrival of a new query are the same as those 
seen for a CS. When the SS receives an ask message, a new object is generated, having as state: 

- total, if the involved term is a term in an original query; this can be ascertained by checking whether 
the set of the already visited terms has 2 elements; 

- partial, if the query is not total; this can be obviously ascertained by checking whether the set of 
already visited terms contains more than two terms. 

If the answer time-out of a total or of a partial object expires before the answer is received, a tell message 
with the special symbol e is generated; e is interpreted as an empty answer not to be cached. Notice that 
this preserves soundness of the query evaluation, while giving up completeness. 

Finally, when a non-e tell message arrives for a total object, the object becomes closed and the answer 
is stored to be used to answer future queries, until it expires. On the other hand, when the tell message 
relative to a partial object arrives, the object is destroyed. 

The Server Query Processor performs the operations of the Query procedure presented in the previous 
Section, reducing a full query to a term query, and launching the execution of the latter via an ask message. 

ask and tell messages are handled by the servers of the Ask and Tell Queues, respectively, as shown in 
Figure [5] right. The Ask Queue Server (AQS) is presented in Figure IHl AQS implements the Ask procedure 
of the previous Section. It dequeues the first message from the Ask-Queue. Notice that at this point an 
object with Query-ID attribute equal to ID has already been created, either by the Server Query Processor 
or by the AQS itself, but this object misses the proper values of the n and QP attributes, which can be 
computed only after analyzing the corresponding query. This analysis is carried out now. AQS then checks 
(line 3) whether the Cache contains a closed object I whose query expression is the same term t as the 
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Ask-Queue-Server 

1. until ASK-QuEUE / do 



2. {ID, t, A) Dequeue(Ask-Queue); 

3. if exists I G Query-Cache such that exp[/] =t A state[/]=closed then 

4. Enqueue(Tell-Queue,(/L', answer[Z])); 

5. else if exists / G Query-Cache such that Query-ID [Z] = ID then 

6. n ^ 0; QP, Q ^ 0; 

7. for each hyperedge h = ({ui, . . . , Ur}, t) such that {wi, . . . , Ur} fl A = do 

8. C ^ 0; 

9. for each Ui do 

10. IDi ^ (/Z),New-Num); 

11. C ^ C U {ID,}; 

12. 7i<-n + l; 

13. ENQUEUE(g, (/Di, Ui, A U {tii})); 

14. QP ^ QPU {C}; 

15. if n > then 

16. n[/] ^ n; QP[/] ^ QP; 

17. until Q / do 

18. (/, u, B) Dequeue(Q); 

19. m NEW-QUERY-CACHE; 

20. Query-ID[m] <— /; exp[m] u; 

21. n[m] 0; QP[?n], answer[?n] 0; dep[m] nitZZ; 

22. time-out [m] ^ NOW+fa; 

23. if 1^1 = 2 then state[m] total else state[?n] -s— partial; 

24. Enqueue(Ask-Queue,(7, u, B)); 

25. else Enqueue(Tell-Queue,(/£', I(t))); 



Figure 9: Ask Queue Server in CCD approach 
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request being processed. If such an object exists, its answer is used to answer the current ask message, by 
creating a corresponding tell message and enqueuing it on the Tell-Queue (hne 4). If such an object 
does not exist, AQS checks (hne 5) whether the object corresponding to the query ID still exists. If not, 
this object has been destroyed because the answer time-out has expired, and in this case AQS does nothing. 
If the object is found (in the variable AQS examines the B-graph in order to find the hyperedges to be 
considered, i.e. those which pass the test performed by Qe. If no hyperedge is found, then n remains 0, the 
test on line 15 fails, and a tell message with I{t) is put (line 25) on the Tell-Queue. Otherwise (lines 7 to 
14), each hyperedge is processed by generating a new sub-query for each term Ui in its tail. The information 
required to launch the execution of the sub-query (namely, its id, its term, and the set of visited terms) is 
temporarily stored on a local queue Q. The sub-program corresponding to the hyperedge is accumulated 
on the variable C, while QP and n store all generated sub-programs and the total number of open calls, 
respectively. These values are assigned to the QP and n attributes of I thus completing the initialization of 
this object (line 16). Finally, AQS launches the evaluation of the generated sub-queries, in the loop on lines 
17-24. Until Q is empty, it dequeues the information for constructing an ask message for each sub-query, 
creating in the Cache a log object m representing the sub-query, m is initialized in the lines 20-23 and finally 
the corresponding sub-query is asked by putting an ask message on the Ask-Queue. 

Tell-Queue-Server 

1. until Tell-Queue / do 

2. {ID, R) 4- Dequeue(Tell-Queue); 

3. if exists / G Query-Cache such that Query- ID [/] = ID then 



4. if state[/]=total and R ^ e then do state[/] closed; answer[Z] R; time-out[Z] NOW-|-tc; 

5. else delete(Query-Cache, I); 

6. if R = e then R -i- 0; 

7. let Zi G Query-Cache such that ID occurs in QP[Zi]; 

8. QPi^ Close(QP[«i], Query-ID[Zi], 7?); 

9. if n[li] > 1 then do n[h] ^ n[li] - 1; QP[li] ^ QPi; 

10. else do 

11. S ^ Compute-answer(Q-Pi); 

12. if exp[Zi] G Tseif then Enqueue(Tell-Queue, (Query-ID[/i], 5* U /(exp[Zi])); 

13. else do 

14. if state[Zi] 7^ declined then 

15. if EXTRACT-PlD(/_D) = se// then ENQUEUE(ANSWER-QuEUE,(Query-ID[/i], S)); 

16. else ENQUEUE(OuTPUT-REQUEST-QuEUE,(Query-ID[/i], S)); 

17. if state[Zi] — principal or if state[/i] — declined then 

for each I' G Query-Cache such that dep[Z'] = ID do 

18. if EXTRACT-PlD(Query-ID[/'])=se// then ENQUEUE(ANSWER-QuEUE,(Query-ID[Z'], S)); 

19. else ENQUEUE(OuTPUT-REQUEST-QuEUE,(Query-ID[;'], S)); 

20. delete(Query-Cache, I') 

21. state[Zi] <— closed; answer[Zi] ^ S; time-out[Zi] ^ NOW+tc. 



Figure 10: Tell Queue Server in CCD approach 

The Tell-Queue-Server (TQS) procedure is presented in Figure [TUl First, a tell message is dequeued. 
At this stage, an object I with Query-ID equal to ID has been created in the QC (by AQS) and ID is also 
an open call in the query program of some other object h. TQS manages both I and li. In particular, TQS 
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has to check whether the tell message being processed completes Zi's evaluation, in which case TQS must 
issue the relative tell message. Moreover, if li is associated to an original query, TQS must properly handle 
the answer and the state of h. 

TQS first checks whether I still exists in the QC. If it does not, then nothing is done. Otherwise, TQS 
checks whether the current answer can be re- used, which means that Vs state is total and R is not the 
special symbol e. If this is indeed the case, / becomes closed and is properly updated (line 4); if not, I is 
destroyed (line 5). On line 6, the meaning of e as the empty answer in installed in R. Then, TQS retrieves 
li (line 7) and uses Close to modify the query program QP in it, by closing the open call QID: this means 
to replace QID by R, obtaining a new query program QPi (line 8). Then, the number of open calls of 
li is tested: if there are still open calls, h evaluation is not complete, thus its n and QP attributes are 
updated and the procedure terminates. If the test on line 9 fails, i.e. n = 1, then the evaluation of the 
query corresponding to li is completed, therefore the result is computed in S and it is tested whether the 
query term is in the terminology of the source (line 12). If yes, the ID is a sub-query, therefore the obtained 
result is stored on a tell message which is enqueued on the Tell Queue. If the query term of li is not in 
the terminology of the source, then it is a dummy term hence QIDi is the id of an original query q whose 
evaluation has been just completed. In this case TQS must also manage the object li. If the state of is 
declined then the answer time-out of this object has expired, thus the just computed answer is no longer 
usable for its query. In all the other cases, the answer must be returned, either locally (if PID is the id of the 
present source, line 15) or remotely (line 16). If the state of li is principal or declined, then other queries 
are depending on li. Each object /' relative to one dependent query is identified (line 17) and deleted (line 
20) after the corresponding answer is output either in the AQ (if local, line 18) or in the ORQ (if remote, 
line 19). Finally, li is updated (line 21) to be subsequently re-used, until it expires. 

4.3 Query evaluation in the DDD Approach 

From an architectural point of view, DDD is the pure P2P approach, in which all sources are of the same 
kind, each one storing its own taxonomy and associated interpretation. 

A DDD Source receives queries on its terminology from local applications, or ask messages from other 
sources which, due to articulations, need to evaluate some sub-query on the local terminology. The Source 
Component carries out the evaluation of these queries or ask messages by relying on its taxonomy and 
interpretation. Whenever it requires an answer to a sub-query outside its terminology, it asks the appropriate 
source, from which it receives the corresponding answer in a tell message. 

The life-cycle of a query in a DDD Source is identical to that in a CCD Server Source. 

A DDD Source Component includes 4 main processes: Query Processor, Ask Processor, Tell Processor, 
and Cache Manager. 

Query Processor Query Processor (QP for short) is invoked by the Input Request Queue Server whenever 
a Query message is dequeued having as fields [ID, q). It performs the following operations: 

- If no query exists in the Cache with expression equal to (j, then QP behaves like a CCD Server: reduces 
g to a term query t, creates a new free query in the Cache and puts an ask message into the Input 
Request Queue, in order to launch the evaluation of t. 

- If there exists a closed query in the Cache with expression equal to q^ then QP uses the answer to 
that query to create an answer message for the input query, which it then puts in the Answer Queue. 
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- If there exists a query g' with expression equal to the input query and its state is neither dependent 
nor closed, then QP behaves hke a CCD CHent: creates a new dependent (from q') query into the 
Cache and if q' is free, QP sets it to principal. 

Ask Processor The DDD Ask Processor (AP) behaves like the CCD Ask Queue Server, except that the 
ask messages regarding terms belonging to other sources' terminologies arc inserted into the Output Request 
Queue (hence into the network) rather than into the Ask Queue. As already pointed out, the ask messages 
posted on the ORQ will end up in the IRQ of the receiving source, from where the IRQ Server dequeues 
them, and uses their content to invoke the local AP. 

Tell Processor The DDD Tell Processor (TP) behaves like the CCD Tell Queue Server, except that the 
messages regarding the evaluation of sub-queries requested by other sources are inserted into the Output 
Request Queue (hence into the network). Through the previously described path, these messages will be 
processed by the TP of the receiving Source. 

4.4 Query evaluation in the CDR Approach 

In a CDR architecture, the taxonomy is centralized but every source has its own interpretation concerning 
the local terminology. This is a hybrid P2P approach, which applies, for instance, when a central authority 
controls the vocabulary used by a community of speakers for indexing their objects. As a consequence, there 
exist two types of sources: Source and Server Taxonomy Source. 

4.4.1 CDR Source 

The CDR Source component consists of 5 main processes: Query Processor, Query Program Processor, 
Answer Processor, Local Interpretation Processor, and Cache Manager. 

The Query Processor receives queries from local applications. It sends each query to a server taxonomy 
source to obtain the re-write of the query. When the re-written query arrives, it is handled by the Query 
Program Processor, which evaluates it by retrieving the interpretation for local terms, while asking the 
appropriate sources for the interpretation of the foreign terms. These requests are handled by the Local 
Interpretation Server of the involved Sources. The answers to these requests are handled by the Answer 
Processor, which eventually computes the query answer and makes it available to the requesting application. 

The life-cycle of a QC object in a CDR Source extends that of a CCD Client Source, in order to manage 
the event of the arrival of a query re-write, performed by the Query Program Processor as described below. 

Query Program Processor When a message containing a query re-write (ZD, rw) arrives, the Query 
Program Processor (QPP) checks whether the Cache contains an object I whose Query-ID attribute value 
is ID. If not, the object has expired and has been destroyed, so nothing is done. If yes, QPP performs the 
following operations: 

- if the state of I is free or principal, it changes it to f ree-rw or principal-rw, respectively; if the 
state of I is declined, the object remains in the same state. In all these cases, the QP attribute is 
updated with the re-write, thereby caching the re-write for re-use. 

- it launches the evaluation of rw as follows: 
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— it groups the terms in rw by the source they belong; 

— it retrieves the interpretations of the local terms and inserts them into an answer message; 

— it requests interpretations of the foreign terms by sending a message to the appropriate sources. 

The grouping is very important, because it avoids requesting the same source more than once. 

Query Processor The Query Processor behaves like the Query Processor in the CCD Client Source, 
except that when the Cache contains an object I with the same expression as the received one and whose 
state is not closed or dependent, QP creates a new dependent query object that depends on I and if the 
state of I is free or free-rw, QP changes it to principal or principal-rw, respectively. 

Answer Processor When an answer message {ID, A) arrives containing the previously requested inter- 
pretation(s) of foreign term(s), the Answer Processor (AP) checks whether the QC contains an object I 
whose Query-ID attribute value is ID. If not, the object has expired and has been destroyed; in this case, 
AP does nothing. If / is found in the QC, then the sets of objects contained in the answer message are used 
to replace the corresponding terms in the QP of /; if every term in QP has been replaced, then the answer 
to the query is computed; moreover, the time-out attribute of I is updated to cache time-out and the query 
object is closed. In addition: 

- if the query is free-rw, the answer is output by putting a message in the Answer Queue; 

- if the query is principal-rw, the answer is output for the present and for all the depending queries; 

- if the query is declined, an answer is output only for all depending queries. 

Local Interpretation Server The Local Interpretation Server is invoked when a message requesting the 
interpretation of a set of terms belonging to the local terminology, arrives. It retrieves the interpretation of 
every requested term, puts the result in an answer message, and places the message into the Output Request 
Queue. 

4.4.2 Server Taxonomy Source 

A Server Taxonomy Source (STS) carries out three basic tasks: the re-write of all queries that it receives; 
the evaluation of the queries that it receives from local applications; and the evaluation of the terms in 
its terminology requested by remote sources. The re-writing stage is carried out locally as described in 
Section 13.31 which means that all ask and tell messages are local. Instead, the second stage of query 
evaluation implies the exchange of ask and tell messages with the sources holding the foreign terms, and 
this time ask and tell messages are as described in Section [3.11 

The life-cycle of a QC object in a Server Taxonomy Source extends that of the CCD Server Source with 
the management of the event concerning the arrival of a re-write request for a query. During this stage, 
partial or total QC objects are generated and evolved accordingly. When the final tell message arrives 
for a total query object, the object becomes free-rw and its QP value is saved to be re-used for re-writing 
queries with the same expression. On the other hand, when the final tell message to a partial query 
object arrives, the object is destroyed. 
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Server Taxonomy Query Processor The Server Taxonomy Query Processor (STQP) is invoked when 
a message requesting to re-write a query q arrives from a loeal appheation or from a remote source. STQP 
initiates the query re-write by placing an appropriate ask message onto the Ask Queue, similarly to the 
Query Processor in the DDD approach, with the following difference: when it creates a new dependent 
object in the Cache, if the state of the referred object is free or f ree-rw, STQP changes it to principal 
or principal-rw, respectively. 

Ask Queue Server and Tell Queue Server These Servers behave in the same way as in the CCD 

architecture, except that tell messages contain linearizations of a re- write, as explained in Section [3.31 

Answer Processor The Answer Processor (AP) in a STS performs the job of the Answer Processor and 
the Query Program Processor in a CDR Source. Thus, an AP receives either a query re-write to evaluate, 
or the interpretation of a set of foreign terms. 

4.5 Query evaluation in the DCR Approach 

This approach is dual to CDR. It includes two types of sources: Source and Server Interpretation Source. 
The life-cycle of a query in both types of sources is identical to that of the CDR Server Taxonomy Source. 

DCR Source A DCR Source does not have any interpretation, it only has the taxonomy of the terms 
in its own terminology and articulations. When it receives queries from local applications, it cooperates 
with other sources in order to carry out the re-writing stage. In particular, it sends ask messages for the 
re-writing of foreign terms, and receives ask messages for the re-writing of its own terms, tell messages 
flow correspondingly. When the re-writing stage is completed, the source sends the obtained query re-write 
to a Server Interpretation Source for evaluation. 

Server Interpretation Source A Server Interpretation Source (SIS) has its own taxonomy and the whole 
network interpretation. It carries out the re-writing of queries in cooperation with the other sources, and in 
addition evaluates query re-writes. 

4.6 Query evaluation in the DDR Approach 

Similarly to the DDD approach, there is only one type of Source in DDR. A DDR Source receives queries 
from local applications, and answer, ask, tell, and interpretation request messages from other sources. 
The Source carries out both the first and the second stage of re-write based query evaluation method in 
cooperation with the other sources. 

5 Performance Evaluation 

In order to evaluate and compare the algorithms described in the previous Section from a performance point 
of view, a simulation experiment has been run for each of the 5 methods, using the same underlying network 
and under the same query flow. The results of this experiment are summarized in Table SI which also 
indicates how long it took to obtain a stable average response time in each case. The rest of this Section is 
devoted to a description of the way these results have been obtained, and a discussion on their meaning. 
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Method 


Time to 
stabilize 
( minutes ) 


Avg response time 
per retrieved object 
(milliseconds) 


St. dev. response time 
per retrieved object 
( milliseconds ) 


CCD 


515 


21.512 


1.472 


CDR 


445 


25.301 


0.139 


DCR 


660 


32.799 


0.313 


DDR 


650 


34.336 


0.251 


DDD 


660 


42.423 


1.132 



Table 4: Performance evaluation of the 5 methods 



5.1 The Network Model 

The models of Source used for the simulation are exactly the same as those presented in the previous Section. 
Thus, all types of Sources are structured as illustrated in Figure |6l and differ from one another in the Source 
Component, which consists of the specific processes that have been illustrated in the previous Section, for 
each approach. 
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Figure 11: Statistical distribution of captured Gnutella peers 



The data for configuring the underlying communication network, as well as other important parameters, 
have been taken from statistical investigations on the Internet, carried out at the University of Washington 
in the context of studies on peer-to-peer file sharing [SOlIJ^ (see Figure [TT|) . This information has been used 
to estimate the delay of operations performed by the TCP protocol and also the statistical distribution of 
queries over time. The total delay is structured as follows: 

- Queue Delay, depending on the degree of congestion of the network and the size of the involved Queue; 
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- Processing Delay, given by the time required for: creating messages and decomposing/recomposing 
them into/from packets, and local query execution. It has been assumed that it takes 10"^ seconds to 
process one packet, and that a disk access takes 6 milliseconds (ms). 

- Packet Transmission Delay, proportional to the size of the packet and the bandwidth; 

- Propagation Delay, this is the network latency. 

In order to measure the network latency we have used the estimations done on the Gnutella network, given 
by the RTT (round-trip time) of a 40-byte TCP packet exchanged between a peer and the measurement 
host. In particular, we have used the following distribution: 

- 20% of peers have a latency smaller than 70 ms; 

- 20% of peers have a latency higher than 280 ms; 

- the remaining 60% of peers have latency uniformly distributed in between 70 and 280 ms. 

Also to estimate the network bandwidth we have relied on the measurements done on Gnutella, according 
to which 78% of the users are connected on a large bandwidth (Cable, DSL, Tl or T3); according to the 
same estimation, about 30% of the users have a connection bandwidth higher than 3Mbps. 

The size of the network was chosen to be 11400 sources, which is the maximum number of simultaneously 
connected peers in Gnutella over a continuous period of 192 hours. 

The number of servers utilized for CCD, CDR and DCR approaches is 57, that is 0.5% of the total 
number of sources. Clearly, all the servers store the same network taxonomy or interpretation (or both). 

In order to obtain the taxonomy, interpretation, and articulations of each source, the following parameters 
have been used, all with a uniform distribution: 

- the size of the terminology is between 1 and 500 terms; 

- the size of the interpretation of any term is between 1 and 100 objects; 

- every source is articulated with 1 to 4 other sources; 

- the size of a local taxonomy is 25% the size of the corresponding terminology; 

- the total number of articulations is 6% of the size of the network terminology. 

We have assumed that objects are URLs. This is typical in P2P networks. Following [50], the average 
size of a URL is 63.4 bytes. The size of the internal representation of a term is assumed to be the minimum 
amount of space required to uniquely identify an object within a set (the terminology where the term belongs) 
on a network of sources, each identified by an IP number. Finally, the values of the time-out are as follows: 
answer time-out (the time a source waits for an answer to arrive) is 60 seconds, while the cache time-out 
(the time a source keeps an answer in the Cache for re-use) is 600 seconds. 

Most of these parameters are configurable. 
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5.2 The simulation experiments 

In each experiment, the foUowing variables have been measured: 

- the time required to evaluate a query; 

- the number of evaluated queries, required to compute the average response time; 

- the size of the query result; 

- the number of packets transmitted, including the packets exchanged for the ping-pong protocol, which 
has been used for inter-source communication; 

- the number of visited sources. 

For every approach, we have run a simulation experiment for the amount of time required to obtain a 
stable value for the response time. Each query is characterized by the following attributes: 

- an integer number that identifies the query; 

- the id of the source that formulates the query, randomly chosen in the interval [1, 11400]; 

- a set of terms randomly chosen from the terminology of the selected source, to be understood as a 
conjunction; 

- the time at which the query is issued. 

The query distribution is obtained from the statistical distribution of connected peers reported in Figure fTTl 
In particular, the number of queries per unit of time is directly proportional to the number of connected peers. 
Since the same random generators have been used in all experiments for generating the query parameters, 
the same query distribution is used in each experiment. 

In order to reduce the size of the required storage, we have gathered statistics for periods of 5 minutes. 
The average response time was divided by the size of the result; it was considered to be stable when the 
difference between the values for 3 consecutive periods of 5 minutes, equivalent to 15 minutes of simulated 
time, was less that 10"^ milliseconds. The goodness of this choice is confirmed by the low value of the 
standard deviation for all evaluated methods. 

5.3 Results and discussion 

Figure [1^ details the distribution of the average response time in time for each method. As this Figure and 
Table|4]show, the fastest method is not surprisingly CCD, that is direct evaluation when both taxonomy and 
interpretation are centralized. Perhaps more surprisingly, the next best method is CDR, for reasons that will 
be analyzed below. As expected, the worst method is DDD, which does not allow any type of optimization, 
and at each step of the execution algorithm sends all collected terms and objects. The performance of the 
DCR and DDR methods tend to be the same, the former being slightly better because it can count on a 
centralized interpretation. 

Needless to say, the actual values of the measured variables are determined also by the parameters 
chosen to configure the simulator, detailed above. What really matters are the relative values between the 
5 compared methods, or in other words, the ranking of the methods produced by the experiment. 

In order to verify the correctness of the results, for every pair of measured variables the correlation 
coefficient has been computed. Correlation has been confirmed between: 
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- number of generated sub-queries and the number of exchanged messages, 

- number of exchanged messages and number of visited sources. 

In contrast, the number of exchanged messages and the size of the answer are independent. 

The average number of sources visited for evaluating one query is reported in Figure [131 The queries that 
either addressed the terminologies of sources with no articulations or could be answered using the Cache, 
were evaluated locally, and therefore considered to visit no sources. This explains why the numbers in 
Figure [T51 are so low, CCD being almost everywhere below 1. As it can be seen, re- writing methods are those 
involving a higher number of visits, with DDR having the highest for obvious reasons. Not surprisingly, CCD 
is the method requiring the least number of visits of all. However, the graphic suggests another clustering: 
the methods in which the taxonomy is distributed have similar, very irregular curves, whilst those in which 
the taxonomy is centralized have similar, much more regular curves. This explains why CDR is the second 
best method after CCD: the centralization of the taxonomy allows to re-write the query by contacting at 
most one source, the taxonomy server; if instead the taxonomy is distributed, several sources may need to be 
contacted in the re-writing stage, with the possibility that the same source be contacted more than once, if 
articulations require so. The distribution of the interpretation may require to contact several sources for the 
second stage, but the fact that this second stage is optimized implies that every involved source is contacted 
exactly once, and this makes this step affordable, so that on average 2 sources need to be visited, with a very 
small standard deviation (in fact, for CDR the average number of visited source is 1.887, and the standard 
deviation is 0.053). 

This is confirmed by the number of exchanged messages (Figure I14p . As expected the methods in which 
the taxonomy is distributed are those which require more message exchanges, with DDR being the worse 
due to the fact that also interpretations are distributed and the re-writing approach is followed. Also in this 
case CDR, although very similar to DDR, does much better, up to being not significantly different from the 
best method CCD. 

Thus we can conclude that CDR is superior to all other distributed approaches because the optimization 
taking place between the 2 stages of the re- writing compensates the fact that more than one source needs to 
be contacted due to the distribution of the interpretations. In other words, distributing the taxonomy affects 
the performance of the method in a significant way, whilst optimization can compensate the distribution of 
the interpretations. 

6 Related work 

In this Section we relate our work with the literature on peer-to-peer systems and data integration, with 
emphasis on the former. Some parts of the work reported in this paper have been already published. 
Namely, j36j presents a first model of a network of articulated sources, while |35| studies query evaluation 
on taxonomies including only term-to-term subsumption relationships. Finally, |26| introduced QE(without 
proving its soundness and completeness), and gives hardness results for language extensions. 

Description of our work in relation with P2P systems A peer-to-peer (P2P) system is a distributed 
system in which participants (the peers) rely on one another for service, rather than solely relying on 
dedicated and often centralized servers. The most popular P2P systems have focused on specific application 
domains like music file sharing [3l [TJ [2] ) or on providing file-system- like capabilities [9]. In most of the cases. 
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these systems do not provide semantic-based retrieval services as the name of an object (e.g. the title of a 
music file) is the only means for describing the object contents. 

In our work, we make a distinction between the logical model of a network (presented in Section [2]) and 
the architecture for implementing this model, which may be considered as a physical model of the network. 
Typically, this distinction is not made in the literature on P2P systems, thus we can compare only our pure 
P2P architecture, i.e. DDD, with the P2P literature. 

In the DDD approach, in order to evaluate a query q posed to a peer S, the incoming query (which is 
always expressed over its own taxonomy) is propagated only to those peers to which S has an articulation, 
and which can therefore contribute to the answer of the query (the latter is determined by the taxonomy 
and the articulations of S). Specifically, it is not the original query to be propagated, but a set of term 
sub-queries, each one belonging to the terminology of the recipient peer. Note that in DDD there is not 
any form of centralized index (like in Napster [3]), nor any flooding of queries (like in Gnutella [1]), nor any 
form of partitioned global index (like in Chord [33] and CAN [28]). Instead, we have a query propagation 
mechanism that is query- and articulation-dependent (note that Semantic Overlay Networks [13] is a very 
simplistic approach to this). Moreover note that the peers of our DDD model are quite autonomous in the 
sense that they do not have to share or publish their stored objects, taxonomies or mappings with the rest 
of the peers (neither to one central server, nor to the on-line peers). To participate in the network, a peer 
just has to answer the incoming queries by using its local base, and to propagate queries to those peers that 
according to its "knowledge" (i.e. taxonomy -I- articulations) may contribute to the evaluation of the query. 
However both of the above tasks are optional and at the "will" of the peer. 

From a data modeling point of view several approaches for P2P systems have been proposed recently, 
including relational-based approaches [8], XML-based approaches [22] and RDF-bascd [27]. In this paper 
we consider the fully heterogeneous conceptual model approach (where each peer can have its own schema) , 
with the only restriction that each conceptual model is represented as a taxonomy. A taxonomy can range 
from a simple tree-structured hierarchy of terms, to the concept lattice derived by Formal Concept Analysis 
[19) . or to the concept lattice of a Description Logics theory. This taxonomy-based conceptual modeling 
approach has three main advantages (for more see [36]): (a) it is very easy to create the conceptual model 
of a source, (b) the integration of information from multiple sources can be done easily, and (c) automatic 
articulation using data-driven methods (like the one presented in [53]) are possible. 

Recently, there have been several works on P2P systems endowed with logic-based models of the peers' 
information bases and of the mappings relating them (called P2P mappings). These works can be classified 
in 2 broad categories: (1) those assuming propositional or Horn clauses as representation language or as a 
computational framework, and (2) those based on more powerful formalisms. With respect to the former 
category (e.(/., see [S] [3 [5]), our work makes an important contribution, by providing a much simpler 
algorithm for performing query answering than those based on resolution. Indeed, we do rely on the theory 
of propositional Horn clauses, but only for proving the correctness of our algorithm. For implementing query 
evaluation, we devise an algorithm that avoids the (unnecessary) algorithmic complications that plague the 
methods based on resolution. As an example, after appropriate transformations our framework can be seen 
as a special case of that in [7]. Then, query evaluation can be performed by first computing the prime 
implicates of the negation of each term in the query, using the resolution-based algorithms presented in [7] . 
As the complexity of this problem is exponential w.r.t the size of the taxonomy and polynomial w.r.t. the 
size of Obj, there is no computational gain in using this approach. Instead, there is an algorithmic loss, since 
the method is much more complicated than ours. 
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As for the second eategory above, works in this area have focused on providing highly expressive knowledge 
representation languages in order to capture at once the widest range of applications. Notably, |12j proposes 
a model allowing, among other things, for existential quantification both in the bodies and in the heads of the 
mapping rules. Inevitably, such languages pose computational problems: deciding membership of a tuple in 
the answer of a query is undecidable in the framework proposed by |12| . while disjunction in the rules' heads 
makes the same problem coNP-hard already for datalog with unary predicate (i.e. terms). These problems 
are circumvented in both approaches by changing the semantics of a P2P network, in particular by adopting 
an epistemic reading of mappings. As a result, the inferential relation of the resulting logic is weakened 
up to the point of making the above mentioned decision problem solvable in polynomial time. In contrast, 
our approach aims at reaching the same goal (efficient support for P2P information access), but from a 
different standpoint: in particular, we achieve efficiency of the network query evaluation by limiting the 
expressiveness of the language for representing mappings (articulations, in our terminology), while retaining 
a classical Tarskian semantics of these mappings (seen as logical formulae). In other words, we aim at a 
smaller class of applications, but for this we offer a framework resting on the classical logical foundations. 
The complementarity of the two approaches is therefore evident. Since we have also shown [26], [35] that 
classical semantics leads to intractability as soon as the expressiveness of the mapping language is increased, 
we can say that we have covered a large part of our side of the trade-off. 

In |10| . a query answering algorithm for simple P2P systems is presented where each peer S is associated 
with a local database, an (exported) peer schema, and a set of local mapping rules from the schema of 
the local database to the peer schema. P2P mapping rules are of the form cqi cq2 , where cqi , cq2 are 
conjunctive queries of the same arity n > 1 (possibly involving existential variables), expressed over the union 
of the schemata of the peers, and over the schema of a single peer, respectivelj0. Note that this representation 
framework partially subsumes our network source framework, since in our case cqi, ccp are of arity I, cqi is a 
conjunctive query of the form ui (x) A. . .AUr (x) over the terminology of a single peeip and q2 is a single atom 
query t{x) over the terminology of the peer that the mapping (articulation) belongs to. However, simple 
P2P systems cannot express the local to a peer S taxonomy ^5 of our framework. Query answering in 
simple P2P systems according to the first-order logic (FOL) semantics is in general undecidable. Therefore, 
the authors adopt a new semantics based on epistemic logic in order to get decidability for query answering. 
Notably, the FOL semantics and epistemic logic semantics for our framework coincide. In particular, in |10| . 
a centralized bottom-up algorithm is presented which essentially constructs a finite database RDB which 
constitutes a "representative" of all the epistemic models of the P2P system. The answers to a conjunctive 
query q are the answers of q w.r.t. RDB. However, though this algorithm has polynomial time complexity, 
it is centralized and it suffers from the drawbacks of bottom-up computation that does not take into account 
the structure of the query. 

The work in [lOj is extended in |12| . where a more general framework for P2P systems is considered, 
which fully subsumes our framework and whose semantics is based on epistemic logic. In particular, in |12| . 
a peer is also associated with a set of (function-free) FOL formulas over the schema of the peer. A top-down 
distributed query answering algorithm is presented which is based on synchronous messaging. Essentially, 
the algorithm returns to the peer where the original query is posed, a datalog program by transferring the 
full extensions of the relevant to the query, peer source predicates along the paths of peers involved in query 
processing. The returned datalog program is used for providing the answers to the query. Obviously, our 

^Note that P2P mapping rules of this kind can accommodate both GAV and LAV-stylc mappings, and are referred in the 
Hterature as GLAV mappings. 

^Recall that this restriction can be easily relaxed. 
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algorithm has computational advantages w.r.t. the algorithm in |12| . since during query evaluation only the 
full or partial answer to a term (sub)query is transferred to the peer that posed the (sub)query, and not the 
full extensions of all terms involved in its evaluation. 

The framework in |31j . extends our framework by considering (i) n-ary (instead of unary) predicates 
(i.e. P2P mappings are general datalog rules) and (ii) a set of domain relations (also suggested in [32]). 
mapping the objects of one peer to the objects of another peer. A distributed query answering algorithm 
is presented based on synchronous messaging. However, the algorithm will perform poorly in our restricted 
frameworljfl, since when a peer receives a (sub)query, it iterates through the relevant P2P mappings and 
for each one of them, sends a (sub)query to the appropriate peer (waiting for its answer), until fix-point is 
reached. In our case, when a peer receives a (sub)query, each relevant P2P mapping is considered just once 
and no iteration until fix-point is required. 

A P2P framework similar to [lOj is presented in [23], where query answering according to FOL semantics 
is investigated. Since in general, query answering is undecidable, the authors present a centralized algorithm 
(employed in the Piazza system |21|). which however is complete (the algorithm is always sound), only for 
the case that polynomial time complexity in query answering can be achieved. This includes the condition 
that inclusion P2P mappings are acyclic. However, such a condition severely restricts the modularity of 
the system. Note that our algorithm is sound and complete even in the case that there are cycles in the 
term dependency path and it always terminates. Thus, our framework allows placing articulations between 
peers without further checks. This is quite important, because the actual interconnections are not under the 
control of any actor in the system. 

In |17[ 116] , the authors consider a framework where each peer is associated with a relational database, 
and P2P mapping rules contain conjunctive queries in both the head and the body of the rule (possibly with 
existential variables), each expressed over the alphabet of a single peer. Again the semantics of the system 
is defined based on epistemic logic [15]. In these papers, a peer database update algorithm is provided 
allowing for subsequent peer queries to be answered locally without fetching data from other nodes at query 
time. The algorithm (which is based on asynchronous messaging) starts at the peer which sends queries to 
all neighbour peers according to the involved mapping rules. When a peer receives a query, the query is 
processed locally by the peer itself using its own data. This first answer is immediately replied back to the 
node which issued the query and sub-queries are propagated similarly to all neighbour peers. When a peer 
receives an answer, (i) it stores the answer locally, (ii) it materializes the view represented in the head on 
the involved mapping rule, and (ii) it propagates the result to the peer that issued the (sub)query. Answer 
propagation stops when no new answer tuples are coming to the peer through any dependency path, that 
is until fix-point is reached. In our case, the database update problem for a peer S amounts to invoking 
iS' : QuERY(g) for each articulation q ^ t from S to another peer S' and storing the answer locally to S. Note 
that our query answering algorithm is also based on asynchronous messaging. However, since it considers a 
limited framework, it is much simpler and no computation until fix-point is required. In particular, for each 
term (sub)query issued to a peer through ask, only one answer is returned through tell. 

Relation with information integration The literature about information integration distinguishes two 
main approaches: the local- as-view (LAV) and the global- as-view (GAV) approach (see [TT1[25] for a compari- 
son). In the LAV approach the contents of the sources are defined as views over the mediator's schema, while 
in the GAV approach the mediator's virtual contents are defined as views of the contents of the sources. The 

^In our framework, domain relations correspond to the identity relation. 
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former approach offers flexibility in representing the contents of the sources, but query answering is "hard" 
because this requires answering queries using views ([Mj [24j |37]). On the other hand, the GAV approach 
offers easy query answering (expansion of queries until getting to source relations), but the addition/deletion 
of a source implies updating the mediator view, i.e. the definition of the mediator relations. In our case, and 
if the articulations contain relationships between single terms, then we have the benefits of both GAV and 
LAV approaches, i.e. (a) the query processing simplicity of the GAV approach, as query processing basically 
reduces to unfolding the query using the definitions specified in the mapping, so as to translate the query in 
terms of accesses (i.e. queries) to the sources, and (b) the modeling scalability of the LAV approach, i.e. the 
addition of a new underlying source does not require changing the previous mappings. On the other hand, 
term-to-query articulations resemble the GAV approach. 

7 Conclusions 

In this paper, we have presented a model of networked information sources, each based on a different 
terminology and endowed with a taxonomy over that terminology, and logically connected to the other 
sources via articulations between respective concepts. The model is very flexible, in that it allows an 
articulation to connect a source to several other sources, by letting the terms in the tail of an articulation 
to be drawn from several terminologies. 

For this kind of systems, we have presented a query evaluation procedure, rooted on an algorithm for 
testing the satisfiability of a set of propositional Horn clauses. Five architectures for implementing this 
procedure in a distributed setting have been considered, stemming from two orthogonal criteria: direct 
evaluation vs. query re-writing, and data allocation. For each of the resulting five interesting architectures, 
an implementation has been described, in terms of processes, communicating asynchronously through several 
queues. The design of the architectures and of the underlying communication scheme has emphasized 
efficiency and scalability. 

The five implementations have been evaluated from a performance point of view, via simulations. A 
ranking has resulted from this evaluation, in which the direct evaluation over a client-server architecture 
(named CCD) overdoes the other ones, followed by the architecture in which the taxonomy is centralized 
and the queries are re- written before evaluation (named CDR). 

Each implementation consists of several processes, each one implementing a complex algorithm. All these 
algorithms have been specified as UML state machines, in order to be simulated. For reasons of space, it has 
not been possible to give a complete account of all this work. In particular, more emphasis has been given 
to the implementations that perform best, CCD and CDR. However, upon request the complete set of UML 
state machines is available. Also the code of the simulator is available upon request. 

We believe that the work presented in the paper provides conclusive knowledge on the addressed problem, 
and can be used to derive an engineered system to be put at work on real-word applications. 
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